Skip to content

3445: AVRO-3520: [java] expose read schema in custom encoding#19

Open
martin-augment wants to merge 2 commits intomainfrom
pr-3445-2025-11-07-12-49-40
Open

3445: AVRO-3520: [java] expose read schema in custom encoding#19
martin-augment wants to merge 2 commits intomainfrom
pr-3445-2025-11-07-12-49-40

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

@martin-augment martin-augment commented Nov 7, 2025

3445: To review by AI


Note

Plumbs read schema into custom encoders and field access so encoders can adapt to schema changes, with tests validating schema-dependent read/write behavior.

  • Reflect/Custom Encoding:
    • Add CustomEncoding.withSchema(Schema) to provide the active schema to encoders.
    • Update ReflectData to create schema-specific field accessors and to initialize top-level custom encoders via withSchema(schema); introduce buildByName to bind accessors per schema.
    • Change FieldAccess#getAccessor to accept Schema and propagate field-level schema in FieldAccessReflect, wiring custom encoders with withSchema.
    • Adjust ReflectionUtil validation to use schema-aware accessors.
  • Tests:
    • Add TestCustomEncoderSchemaChanges covering top-level and nested custom encoders adapting to differing schemas (string vs int and field count).

Written by Cursor Bugbot for commit f378f08. This will update automatically on new commits. Configure here.

@github-actions github-actions bot added the Java label Nov 7, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 7, 2025

Walkthrough

Adds schema-aware hooks to Avro's custom encoding reflection framework. Introduces a withSchema method to CustomEncoding, updates FieldAccess to propagate Schema parameters through the accessor construction chain, and implements schema-specific accessor mappings with comprehensive test coverage.

Changes

Cohort / File(s) Summary
Schema-aware reflection API enhancements
lang/java/avro/src/main/java/org/apache/avro/reflect/CustomEncoding.java, FieldAccess.java
Adds withSchema(Schema schema) no-op method to CustomEncoding for implementation-level schema reaction hooks. Updates FieldAccess abstract method signature from getAccessor(Field) to getAccessor(Field, Schema) to propagate schema information through accessor construction.
Reflection implementation updates
lang/java/avro/src/main/java/org/apache/avro/reflect/FieldAccessReflect.java, ReflectData.java, ReflectionUtil.java
Implements schema propagation in reflection-based accessor construction. FieldAccessReflect extends getAccessor to call withSchema(schema) on custom encoding instances. ReflectData introduces buildByName(Class, Schema) builder and schema-specific accessor mappings. ReflectionUtil updates private test helper methods to accept and propagate Schema parameter.
Schema-aware custom encoder test suite
lang/java/avro/src/test/java/org/apache/avro/reflect/TestCustomEncoderSchemaChanges.java
Adds comprehensive test coverage with multiple schema variants (SCHEMA1–3, SCHEMA_INT1–3), custom encoder implementations (R1Encoder, R1EncoderDouble), test data classes (R1Wrapper, R1), and test cases validating schema-aware encoding behavior across top-level and nested scenarios.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-3445-2025-11-07-12-49-40

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6398a1f and f378f08.

📒 Files selected for processing (6)
  • lang/java/avro/src/main/java/org/apache/avro/reflect/CustomEncoding.java (1 hunks)
  • lang/java/avro/src/main/java/org/apache/avro/reflect/FieldAccess.java (2 hunks)
  • lang/java/avro/src/main/java/org/apache/avro/reflect/FieldAccessReflect.java (2 hunks)
  • lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java (4 hunks)
  • lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectionUtil.java (2 hunks)
  • lang/java/avro/src/test/java/org/apache/avro/reflect/TestCustomEncoderSchemaChanges.java (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lang/java/avro/src/main/java/org/apache/avro/reflect/FieldAccessReflect.java (1)
lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectionUtil.java (1)
  • ReflectionUtil (46-209)
lang/java/avro/src/test/java/org/apache/avro/reflect/TestCustomEncoderSchemaChanges.java (2)
lang/java/avro/src/main/java/org/apache/avro/io/DecoderFactory.java (1)
  • DecoderFactory (39-310)
lang/java/avro/src/main/java/org/apache/avro/io/EncoderFactory.java (1)
  • EncoderFactory (43-425)
lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java (1)
lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectionUtil.java (1)
  • ReflectionUtil (46-209)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: claude-review
  • GitHub Check: codex
🔇 Additional comments (4)
lang/java/avro/src/main/java/org/apache/avro/reflect/FieldAccessReflect.java (1)

31-37: Schema propagation hook is wired correctly.

Calling withSchema(schema) prior to building the accessor gives custom encoders the read schema context without altering the legacy path—nice containment of the new behaviour.

lang/java/avro/src/main/java/org/apache/avro/reflect/CustomEncoding.java (1)

51-61: New extension point is documented well.

The withSchema hook cleanly exposes the reader schema to implementers while keeping existing encoders untouched—good balance between flexibility and compatibility.

lang/java/avro/src/main/java/org/apache/avro/reflect/ReflectData.java (1)

391-427: Schema-aware accessor map is a solid refactor.

Rebuilding the by-name map per schema and threading that through getAccessor ensures field-level encoders see the correct schema while keeping the cache semantics intact—nicely done.

lang/java/avro/src/test/java/org/apache/avro/reflect/TestCustomEncoderSchemaChanges.java (1)

68-300: Great coverage for schema evolution paths.

These round-trip and manual-read scenarios across the varying schemas give strong confidence that schema-aware encoders behave correctly in both top-level and nested contexts.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

}
return result;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Schema-less accessors break custom encoding evolution

The getAccessorFor method returns accessors from the byName map which are created without schema information in the constructor. When custom encodings are accessed through getFieldAccessor, their withSchema method is never called, causing them to use default schemas instead of the actual field schemas. This breaks schema evolution for custom encodings accessed by field name.

Fix in Cursor Fix in Web

@claude
Copy link
Copy Markdown

claude bot commented Nov 7, 2025

Code Review: AVRO-3520 - Expose read schema in custom encoding

Summary

This PR successfully implements a mechanism to expose the read schema to custom encoders, enabling them to adapt to schema evolution. The implementation is solid with good test coverage. Below are detailed findings:


Strengths

  1. Good Design Pattern: The withSchema(Schema) method follows the factory pattern nicely, allowing custom encoders to return schema-specific instances while maintaining backward compatibility with a sensible default implementation.

  2. Comprehensive Test Coverage: TestCustomEncoderSchemaChanges.java provides excellent coverage with 304 lines of tests covering:

    • Top-level custom encoders with varying field counts (1, 2, 3 fields)
    • Nested custom encoders within wrapper classes
    • Type changes (string vs int)
    • Both read and write operations
  3. Backward Compatibility: The default implementation of withSchema() returns this, ensuring existing code without overrides continues to work.

  4. Consistent Application: The schema is threaded through both:

    • Top-level encoders (ReflectData.java:1080)
    • Field-level encoders (FieldAccessReflect.java:36)

⚠️ Issues & Concerns

1. Potential Memory/Performance Issue in ClassAccessorData.createAccessorsFor

Location: ReflectData.java:391-399

private FieldAccessor[] createAccessorsFor(Schema schema) {
    var byNameSchema = buildByName(clazz, schema);  // Creates NEW map of accessors
    List<Schema.Field> avroFields = schema.getFields();
    FieldAccessor[] result = new FieldAccessor[avroFields.size()];
    for (Schema.Field avroField : schema.getFields()) {
        result[avroField.pos()] = byNameSchema.get(avroField.name());
    }
    return result;
}

Problem: Every time createAccessorsFor is called for a new schema, buildByName creates a complete new set of FieldAccessor instances for all fields in the class, even if only one field has a custom encoder that needs schema information.

Impact:

  • Memory overhead: For classes with many fields, this creates N×M accessors (N fields × M different schemas)
  • Performance: Unnecessary object allocation and reflection calls
  • Cache pollution: The WeakHashMap<Schema, FieldAccessor[]> in bySchema will accumulate entries for each unique schema instance

Recommendation:
Consider reusing accessors for fields that don't have custom encoders. Only fields with @AvroEncode annotations need schema-specific instances:

private FieldAccessor[] createAccessorsFor(Schema schema) {
    List<Schema.Field> avroFields = schema.getFields();
    FieldAccessor[] result = new FieldAccessor[avroFields.size()];
    for (Schema.Field avroField : schema.getFields()) {
        String fieldName = avroField.name();
        FieldAccessor baseAccessor = byName.get(fieldName);
        
        // Only create schema-specific accessor if field has custom encoding
        if (needsSchemaAwareAccessor(baseAccessor)) {
            Field javaField = getFieldForName(clazz, fieldName);
            result[avroField.pos()] = ReflectionUtil.getFieldAccess()
                .getAccessor(javaField, avroField.schema());
        } else {
            result[avroField.pos()] = baseAccessor;
        }
    }
    return result;
}

2. Thread Safety Concern in Custom Encoder Implementations

Location: CustomEncoding.java:59-61

public CustomEncoding<T> withSchema(Schema schema) {
    return this;
}

Problem: The test implementations (R1Encoder, R1EncoderDouble) mutate the schema field:

@Override
public CustomEncoding<R1> withSchema(Schema schema) {
    var encoder = new R1Encoder();
    encoder.schema = schema;  // Direct field mutation
    return encoder;
}

While the test creates new instances (good!), the Javadoc doesn't warn users about thread-safety requirements.

Recommendation: Add documentation to CustomEncoding.withSchema:

/**
 * Receives the schema, giving the concrete encoder implementation an
 * opportunity to detect schema changes and behave accordingly. Useful for
 * maintaining backwards compatibility.
 *
 * <p>Important: Implementations should return a NEW instance if they need
 * to store schema-specific state. The returned encoder may be cached and
 * used concurrently, so it must be thread-safe.
 *
 * @param schema the schema detected during read.
 * @return custom encoding to be used (may be a new instance or this if stateless)
 */

3. Null Schema Handling

Location: Multiple places where schema parameter is passed

The code passes null schemas in several places:

  • ClassAccessorData constructor: buildByName(c, null) (line 372)
  • buildByName: fieldSchema = field != null ? field.schema() : null (line 421)

Problem: withSchema(null) is never validated. If a custom encoder doesn't handle null schemas, it could throw NPE.

Recommendation:

  1. Document the null case in CustomEncoding.withSchema Javadoc
  2. Consider defensive null checks in the custom encoder base implementation
  3. Or avoid calling withSchema when schema is null

4. Missing Test Cases

While test coverage is good, consider adding:

  1. Null schema handling test: What happens when schema is null?
  2. Schema with unions: Does this work correctly with union types?
  3. Recursive schemas: Classes that reference themselves
  4. Error handling: What if withSchema throws an exception?
  5. Concurrent access: Multiple threads reading with different schemas

🔒 Security

No security concerns identified. The changes don't introduce any new attack vectors:

  • No user input is processed differently
  • No new reflection capabilities beyond existing patterns
  • Schema objects come from trusted sources (compile-time or serialized data)

Performance

Positive:

  • Schema information enables more efficient encoding/decoding strategies
  • Cached accessors (via WeakHashMap) prevent repeated reflection

Concerns:

  1. Memory: As noted above, creating complete accessor sets for each schema could be expensive
  2. WeakHashMap overhead: Weak references have GC overhead; consider profiling with many schemas
  3. No lazy initialization: buildByName is called eagerly even if custom encoders aren't used

Recommendation: Profile memory usage with workloads involving:

  • Many different schemas for the same class
  • Classes with many fields
  • High-throughput scenarios

📝 Code Quality

Good:

  • Clean separation of concerns
  • Follows existing code patterns
  • Good use of modern Java (var, method references)
  • Proper error handling

Minor Issues:

  1. Variable naming: byNameSchema (line 393) is slightly ambiguous - consider accessorsByName
  2. Code duplication: ReflectData.populateEncoderCache and FieldAccessReflect.getAccessor have similar patterns for instantiating custom encoders - could be extracted to a utility method
  3. Missing @SInCE tags: New public API methods should document the version they were added

🧪 Test Coverage Recommendation

Add a test for schema evolution edge case:

@Test
void testSchemaEvolutionWithFieldRemoval() throws IOException {
    // Write with 3 fields
    var written = writeWithSchema(new R1("1", "2", "3"), SCHEMA3);
    
    // Read with 2 fields - removed field should be skipped gracefully
    var read = readWithSchema(written, SCHEMA2);
    
    assertEquals("1", read.getValue());
    assertEquals("2", read.getName());
    assertNull(read.getName2());
}

Recommended Actions

Before merging:

  1. Address the memory/performance concern in createAccessorsFor
  2. Add thread-safety documentation to CustomEncoding.withSchema
  3. Add null schema handling documentation or guards

Future improvements:

  1. Add test cases for edge cases mentioned above
  2. Profile memory usage with realistic workloads
  3. Consider adding @SInCE JavaDoc tags

📊 Overall Assessment

Score: 7.5/10

This is a well-implemented feature that solves a real problem with backward compatibility. The main concerns are around memory efficiency and documentation. With the recommended changes, this would be production-ready.

Approval recommendation: ✅ Approve with minor changes requested


Great work on this feature! The schema evolution support will be valuable for users dealing with evolving data formats.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.
Comment augment review to trigger a new review.

* opportunity to detect schema changes and behave accordingly. Useful for
* maintaining backwards compatibility.
*
* @param schema the schema detected during read.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc says "schema detected during read", but withSchema is also called on write paths (e.g., ReflectData.populateEncoderCache and in field access for writing), so this phrasing can be misleading. Consider generalizing to "active schema (read or write)" to reflect actual usage.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:documentation; feedback: The Augment AI reviewer is correct that the javadoc should be updated to not mention read or write schema. The method is used with both.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 7, 2025

  • Blocking – Security regression: lang/java/avro/src/main/java/org/apache/avro/util/ClassUtils.java:95 now loads classes with Class.forName(..., true, ...) and no longer calls ClassSecurityValidator.validate(...). Every site that previously relied on that guard (SpecificData, ReflectData, Thrift/Protobuf integration, etc.) will now happily load and initialize whatever class name an Avro schema provides. That re‑opens the deserialization gadget surface the validator was introduced to close.

  • Blocking – Behaviour regression & reduced hardening: lang/java/avro/src/main/java/org/apache/avro/specific/SpecificDatumReader.java:48 drops the built‑in trusted class/package sets and replaces them with raw system‑property splits plus a startsWith check. In practice: (1) commonly used safe classes such as java.lang.String, java.math.BigDecimal, etc. are no longer trusted by default, so existing schemas that rely on java-class immediately start throwing SecurityException; (2) the new prefix test allows com.safe + com.safeEvil.Malicious to match, which weakens the allowlist. Both are backward‑incompatible and reduce the protection the old validator provided.

  • Blocking – Public API removal: lang/java/avro/src/main/java/org/apache/avro/util/ClassSecurityValidator.java has been deleted entirely. That class (including the builder/composite predicates) is part of the published API since 1.11.x and is used by downstream consumers to customise trust policies. Removing it will break source and binary compatibility for any existing client code.

@martin-augment
Copy link
Copy Markdown
Owner Author

  • Blocking – Security regression: lang/java/avro/src/main/java/org/apache/avro/util/ClassUtils.java:95 now loads classes with Class.forName(..., true, ...) and no longer calls ClassSecurityValidator.validate(...). Every site that previously relied on that guard (SpecificData, ReflectData, Thrift/Protobuf integration, etc.) will now happily load and initialize whatever class name an Avro schema provides. That re‑opens the deserialization gadget surface the validator was introduced to close.
  • Blocking – Behaviour regression & reduced hardening: lang/java/avro/src/main/java/org/apache/avro/specific/SpecificDatumReader.java:48 drops the built‑in trusted class/package sets and replaces them with raw system‑property splits plus a startsWith check. In practice: (1) commonly used safe classes such as java.lang.String, java.math.BigDecimal, etc. are no longer trusted by default, so existing schemas that rely on java-class immediately start throwing SecurityException; (2) the new prefix test allows com.safe + com.safeEvil.Malicious to match, which weakens the allowlist. Both are backward‑incompatible and reduce the protection the old validator provided.
  • Blocking – Public API removal: lang/java/avro/src/main/java/org/apache/avro/util/ClassSecurityValidator.java has been deleted entirely. That class (including the builder/composite predicates) is part of the published API since 1.11.x and is used by downstream consumers to customise trust policies. Removing it will break source and binary compatibility for any existing client code.

value:annoying; category:bug; feedback: The Codex AI reviewer talks about classes which are not changed in this Pull Request

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Potential Memory/Performance Issue in ClassAccessorData.createAccessorsFor

Location: ReflectData.java:391-399

private FieldAccessor[] createAccessorsFor(Schema schema) {
    var byNameSchema = buildByName(clazz, schema);  // Creates NEW map of accessors
    List<Schema.Field> avroFields = schema.getFields();
    FieldAccessor[] result = new FieldAccessor[avroFields.size()];
    for (Schema.Field avroField : schema.getFields()) {
        result[avroField.pos()] = byNameSchema.get(avroField.name());
    }
    return result;
}

Problem: Every time createAccessorsFor is called for a new schema, buildByName creates a complete new set of FieldAccessor instances for all fields in the class, even if only one field has a custom encoder that needs schema information.

Impact:

  • Memory overhead: For classes with many fields, this creates N×M accessors (N fields × M different schemas)
  • Performance: Unnecessary object allocation and reflection calls
  • Cache pollution: The WeakHashMap<Schema, FieldAccessor[]> in bySchema will accumulate entries for each unique schema instance

value:useful; category:bug; feedback: The Claude AI reviewer is correct that buildByName(clazz, schema) is a heavy operation and its results should be cached and reused. This would prevent re-creation of the map by using reflection.

@martin-augment
Copy link
Copy Markdown
Owner Author

2. Thread Safety Concern in Custom Encoder Implementations

Location: CustomEncoding.java:59-61

public CustomEncoding<T> withSchema(Schema schema) {
    return this;
}

Problem: The test implementations (R1Encoder, R1EncoderDouble) mutate the schema field:

@Override
public CustomEncoding<R1> withSchema(Schema schema) {
    var encoder = new R1Encoder();
    encoder.schema = schema;  // Direct field mutation
    return encoder;
}

While the test creates new instances (good!), the Javadoc doesn't warn users about thread-safety requirements.

value:useful; category:bug; feedback: The Claude AI reviewer is correct that the new method could easily lead to concurrency issues if the override just does something like this.schema = schema. The method should return a new instance of the CustomEncoding class that uses the provided schema. The finding prevents concurrency/thread-safety issues that would be hard to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants